AITopics | text-conditional image generation

Cocktail: Mixing Multi-Modality Control for Text-Conditional Image Generation

Neural Information Processing SystemsDec-25-2025, 19:12:10 GMT

Text-conditional diffusion models are able to generate high-fidelity images with diverse contents.However, linguistic representations frequently exhibit ambiguous descriptions of the envisioned objective imagery, requiring the incorporation of additional control signals to bolster the efficacy of text-guided diffusion models. In this work, we propose Cocktail, a pipeline to mix various modalities into one embedding, amalgamated with a generalized ControlNet (gControlNet), a controllable normalisation (ControlNorm), and a spatial guidance sampling method, to actualize multi-modal and spatially-refined control for text-conditional diffusion models. Specifically, we introduce a hyper-network gControlNet, dedicated to the alignment and infusion of the control signals from disparate modalities into the pre-trained diffusion model.

control signal, diffusion model, mixing multi-modality control, (9 more...)

Neural Information Processing Systems

Country: Asia > China > Heilongjiang Province > Daqing (0.07)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

67d60c2694f4fecd18fa04d1fa8c0a5c-Supplemental-Conference.pdf

Neural Information Processing SystemsAug-15-2025, 12:54:43 GMT

clip-score 0, hq-transformer, image generation, (16 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.35)

Add feedback

Cocktail: Mixing Multi-Modality Control for Text-Conditional Image Generation

Neural Information Processing SystemsJan-18-2025, 23:18:53 GMT

Text-conditional diffusion models are able to generate high-fidelity images with diverse contents.However, linguistic representations frequently exhibit ambiguous descriptions of the envisioned objective imagery, requiring the incorporation of additional control signals to bolster the efficacy of text-guided diffusion models. In this work, we propose Cocktail, a pipeline to mix various modalities into one embedding, amalgamated with a generalized ControlNet (gControlNet), a controllable normalisation (ControlNorm), and a spatial guidance sampling method, to actualize multi-modal and spatially-refined control for text-conditional diffusion models. Specifically, we introduce a hyper-network gControlNet, dedicated to the alignment and infusion of the control signals from disparate modalities into the pre-trained diffusion model. The control signals are then fused and injected into the backbone model according to our proposed ControlNorm.Furthermore, our advanced spatial guidance sampling methodology proficiently incorporates the control signal into the designated region, thereby circumventing the manifestation of undesired objects within the generated image.We demonstrate the results of our method in controlling various modalities, proving high-quality synthesis and fidelity to multiple external signals.

control signal, diffusion model, text-conditional image generation, (7 more...)

Neural Information Processing Systems

Country: Asia > China > Heilongjiang Province > Daqing (0.08)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.40)

Add feedback

A Novel Sampling Scheme for Text- and Image-Conditional Image Synthesis in Quantized Latent Spaces

Rampas, Dominic, Pernias, Pablo, Aubreville, Marc

arXiv.org Artificial IntelligenceMay-23-2023

Recent advancements in the domain of text-to-image synthesis have culminated in a multitude of enhancements pertaining to quality, fidelity, and diversity. Contemporary techniques enable the generation of highly intricate visuals which rapidly approach near-photorealistic quality. Nevertheless, as progress is achieved, the complexity of these methodologies increases, consequently intensifying the comprehension barrier between individuals within the field and those external to it. In an endeavor to mitigate this disparity, we propose a streamlined approach for text-to-image generation, which encompasses both the training paradigm and the sampling process. Despite its remarkable simplicity, our method yields aesthetically pleasing images with few sampling iterations, allows for intriguing ways for conditioning the model, and imparts advantages absent in state-of-the-art techniques. To demonstrate the efficacy of this approach in achieving outcomes comparable to existing works, we have trained a one-billion parameter text-conditional model, which we refer to as "Paella". In the interest of fostering future exploration in this field, we have made our source code and models publicly accessible for the research community.

artificial intelligence, deep learning, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2211.07292

Country: